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Abstract 

Support Vector Machines (SVMs) perform pattern recognition between two point 
classes by finding a decision surface determined by certain points of the training set, 
termed Support Vectors (SV). This surface, which in some feature space of possibly 
infinite dimension can be regarded as a hyperplane, is obtained from the solution of 
a problem of quadratic programming that depends on a regularization parameter. 
In this paper we study some mathematical properties of support vectors and show 
that the decision surface can be written as the sum of two orthogonal terms, the hrst 
depending only on the margin vectors (which are SVs lying on the margin), the second 
proportional to the regularization parameter. For almost all values of the parameter, 
this enables us to predict how the decision surface varies for small parameter changes. 
In the special but important case of feature space of finite dimension m, we also show 
that there are at most m + 1 margin vectors and observe that m + 1 SVs are usually 
sufficient to fully determine the decision surface. For relatively small m this latter 
result leads to a consistent reduction of the SV number. 
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1 Introduction 

Support Vector Machines (SVMs) have been recently introduced as a new technique for solving 
pattern recognition problems [Cortes and Vapnik 1995, Blanz et al. 1996, Scholkopf et al. 1996, 
Osuna, Freund and Girosi 1997]. According to the theory od SVMs [Vapnik 1982, Vapnik 1995], 
while traditional techniques for pattern recognition are based on the minimization of the empirical 
risk- that is, on the attempt to optimize the performance on the training set -, SVMs minimize 
the structural risk - that is, the probability of misclassifying yet-to-be-seen patterns for a fixed but 
unknown probability distribution of the data. This new induction principle, which is equivalent to 
minimize an upper bound on the generalization error, relies on the theory of uniform convergence 
in probability [Vapnik 1982]. What makes SVMs attractive is (a) the ability to condense the 
information contained in the training set, and (b) the use of families of decision surfaces of 
relatively low VC-dimension [Vapnik and Chervonenkis 1971]. 

In the linear, separable case the key idea of a SVM can be explained in plain words. Given a 
training set S which contains points of either of two classes, a SVM separates the classes through 
a hyperplane determined by certain points of S, termed support vectors. In the separable case, 
this hyperplane maximizes the margin, or twice the minimum distance of either class from the 
hyperplane, and all the support vectors lie at the same minimum distance from the hyperplane 
(and are thus termed margin vectors). In real cases, the two classes may not be separable and 
both the hyperplane and the support vectors are obtained from the solution of a problem of 
constrained optimization. The solution is a trade-off between the largest margin and the lowest 
number of errors, trade-off controlled by a regularization parameter. 

The aim of this paper is to gain a better understanding of the nature of support vectors, and how 
the regularization parameter determines the decision surface, in both the linear and nonlinear 
case. We thus investigate some mathematical properties of support vectors and characterize the 
dependence of the decision surface on the changes of the regularization parameter. The analysis 
is hrst carried out in the simpler linear case and then extended to include nonlinear decision 
surfaces. 

The paper is organized as follows. We hrst review the theory of SVMs in section 2 and then 
present our analysis in section 3. Finally, we summarize the conclusions of our work in section 4. 

2 Theoretical overview 

In this section we recall the basics of the theory of SVM [Vapnik 1995, Cortes and Vapnik 1995] 
in both the linear and nonlinear case. We start with the simple case of linearly separable sets. 

2.1 Optimal separating hyperplane 

In what follows we assume we are given a set S of points x 8 - £ lR n with i = 1, 2, . . . , N. Each 
point X; belongs to either of two classes and thus is given a label j/ 8 - £ { — 1,1}. The goal is to 
establish the equation of a hyperplane that divides S leaving all the points of the same class on 
the same side while maximizing the minimum distance between either of the two classes and the 
hyperplane. To this purpose we need some preliminary definitions. 



Definition 1. The set S is linearly separable if there exist w £ IR n and b £ IR such that 

w -x; + 6 > 1 if j/i = f, m 

wx,- + 6<-l if yi = -l. [ ' 

In more compact notation, the two inequalities (I) can be rewritten 

^■(w-x,- + 6)>l, (2) 

for i = 1, 2, . . . , N. The pair (w, b) defines a hyperplane of equation 

w • x + b = 

named separating hyperplane (see figure 1(a)). If we denote with w the norm of w, the signed 
distance d{ of a point x 8 - from the separating hyperplane (w, b) is given by 



d; 



w • x 8 + b 



w 



Combining inequality (2) and equation (3), for all X{ £ S we have 

Vidi > — • 

w 



(3) 



(4) 



Therefore, 1/w is the lower bound on the distance between the points x 8 - and the separating 
hyperplane (w,&). 





(a) 



(b) 



Figure 1: Separating hyperplane and optimal separating hyperplane. Both solid lines in (a) and 
(6) separate the two identical sets of open circles and triangles, but the solid line in (6) leaves 
the closest points (the filled circles and triangle) at the maximum distance. The dashed lines in 
(6) identify the margin. 

One might ask why not simply rewrite inequality (2) as 

y t (w ■ x 8 + b) > 0. 



The purpose of the "1" in the right hand side of inequality (2) is to establish a one-to-one 
correspondence between separating hyperplanes and their parametric representation. This is 
done through the notion of canonical representation of a separating hyperplane 1 . 

Definition 2. Given a separating hyperplane (w, b) for the linearly separable set S, the canonical 
representation of the separating hyperplane is obtained by rescaling the pair (w, b) into the pair 
(w', b') in such a way that the distance of the closest point equals 1/w' . 

Through this definition we have that 

min X8e5 {j/ 8 (w' • x; + b')} = 1. 

Consequently, for a separating hyperplane in the canonical representation, the bound in inequal- 
ity (4) is tight. In what follows we will assume that a separating hyperplane is always given the 
canonical representation and thus write (w, b) instead of (w',6'). We are now in a position to 
define the notion of optimal separating hyperplane. 

Definition 3. Given a linearly separable set S, the optimal separating hyperplane (OSH) is the 
separating hyperplane which maximizes the distance of the closest point of S. 

Since the distance of the closest point equals 1/w, the OSH can be regarded as the solution of 
the problem of maximizing 1/w subject to the constraint (2), or 

Problem PI 

Minimize |w • w 

subject to j/i(w • x; + b) > 1, i = 1, 2, . . . , N 

Two comments are in order. First, if the pair (w, b) solves PI, then for at least one x 8 - £ S we 
have yi(w ■ x 8 - + b) = 1. In particular, this implies that the solution of PI is always a separating 
hyperplane in the canonical representation. Second, the parameter b enters in the constraints 
but not in the function to be minimized. 

The quantity 2/w, which measures the distance between the two classes in the direction of w, is 
named margin. Hence, the OSH can also be seen as a separating hyperplane which maximizes 
the margin (see figure 1(b)). We now study the properties of the solution of the problem PI. 

2.2 Support vectors 

Problem PI can be solved by means of the classical method of Lagrange multipliers [Bazaraa and Shetty li 

If we denote with a = («i, a 2 , • • • , ct N ) the N nonnegative Lagrange multipliers associated with 

the constraints (2), the solution to problem PI is equivalent to determining the saddle point of 

the function 

I N 

L = -w • w - Y^ a i {^'( w • x; + 6) - 1} . (5) 

with L = L(w, b, a). At the saddle point, L has a minimum for w = w and b = b and a 
maximum for a = a } and thus we can write 

— = 2^,-a,- = 0, ( 6 ) 



^^This intermediate step toward the derivation of optimal separating hyperplanes is slightly different from the 
derivation originally developed in [Cortes and Vapnik 1995]. 
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— = w - 2^ otiUiXi = (7) 



with 

dL , dL dL dL 



dw dw\ ' dw 2 ' ' dw N 

Substituting equations (6) and (7) into the right hand side of (5), we see that problem PI reduces 
to the maximization of the function 

N 1 N 

i = l i f j=l 

subject to the constraint (6) with a > 2 . This new problem is called dual problem and can be 
formulated as 

Problem P2 

Maximize ~\ a ' Doc + J2 a i 

subject to Y,Vi a i = 

a > 0, 

where both sums are for i = 1, 2, . . . , N, and D is an N X N matrix such that 

Dij = yiVjX-i ■ Xj. (8) 

As for the pair (w, 6), from equation (7) it follows that 

N 

w = ^a 8 j/ 8 x 8 , (9) 

8 = 1 

while 6 can be determined from the Kuhn- Tucker conditions 

a,- (j/,-(w • x,- + 6) - l) =0, z = l,2,...,iV. (10) 



Note that the only a 8 - that can be nonzero in equation (10) are those for which the constraints (2) 
are satisfied with the equality sign. The corresponding points x 8 -, termed support vectors, are the 
points of S closest to the OSH (see figure 1(b)). 

Given a support vector Xj, the parameter b can be obtained from the corresponding Kuhn- Tucker 
condition as 

b = Vj -w-Xj. 

The problem of classifying a new data point x is now simply solved by computing 

sign ( w • x + b) . (11) 



In conclusion, the support vectors condense all the information contained in the training set S 
which is needed to classify new data points. 



2 In what follows a > means a; > for every component a; of any vector a. 



2.3 Linearly nonseparable case 

If the set S is not linearly separable or one simply ignores whether or not the set S is linearly 
separable, the problem of searching for an OSH is meaningless (there may be no separating 
hyperplane to start with). Fortunately, the previous analysis can be generalized by introducing 
N nonnegative variables ^ = (£1, £25 • • • 5 &v) such that 

3/,-(w-x,- + &)>1-&, i = l,2,...,N. (12) 

If the point x 8 - satisfies inequality (2), then £,- is null and (12) reduces to (2). Instead, if the point 
X; does not satisfy inequality (2), the term — £,- is added to the right hand side of (2) to obtain 
inequality (12). The generalized OSH is then regarded as the solution to 

Problem P3 

Minimize |w • w + C J2 6' 

subject to j/i(w • x; + &) > 1 — £; i = 1, 2, . . . , N 

The term C J2 6? where the sum is for i = 1, 2, . . . , iV, can be thought of as some measure of the 
amount of misclassihcation. Note that this term leads to a more robust solution, in the statistical 
sense, than the intuitively more appealing term CJ2Ci- ^ n other words, the term CJ2Ci makes 
the OSH less sensitive to the presence of outliers in the training set. The parameter C can be 
regarded as a regularization parameter. The OSH tends to maximize the minimum distance 1/w 
for small C, and minimize the number of misclassihed points for large C. For intermediate values 
of C the solution of problem P3 trades errors for a larger margin. The behavior of the OSH as 
a function of C will be studied in detail in the next section. 

In analogy with what was done for the separable case, problem P3 can be transformed into the 
dual 

Problem P4 

Maximize ~\ OL ' Doc + J2 a i 

subject to Y,Vi a i = 

< ai < C, i = l,2,...,N 

with D the same N X N matrix of the separable case. Note that the dimension of P4 is given by 
the size of the training set, while the dimension of the input space gives the rank of D. From the 
constraints of problem P4 it follows that if C is sufficiently large and the set S linearly separable, 
problem P4 reduces to P2. 
As for the pair (w, 6), it is easy to find that 

N 

w = ^a 8 j/ 8 x 8 , 

8 = 1 

while b can again be determined from a } solution of the dual problem P4, and from the new 
Kuhn- Tucker conditions 



a t [ yi (w ■ x t - + b) - 1 + &) = (13) 

(C-a,-)6- = (14) 

where the £,- are the values of the £,- at the saddle point. Similarly to the separable case, the 
points X; for which on > are termed support vectors. The main difference is that here we have 



to distinguish between the support vectors for which on < C and those for which on = C . In the 
hrst case, from condition (14) it follows that £,- = 0, and hence, from condition (13), that the 
support vectors lie at a distance 1/w from the OSH. These support vectors are termed margin 
vectors. The support vectors for which on = C, instead, are misclassihed points (if £,- > 1), points 
correctly classified but closer than 1/w from the OSH (if < £ < 1), or, in some degenerate 
cases, even points lying on the margin (if £,- = 0). In any event, we refer to all the support 
vectors for which ol{ = C as errors. An example of generalized OSH with the relative margin 
vectors and errors is shown in figure 2. All the points that are not support vectors are correctly 
classified and lie outside the margin strip. 



-* ^l ^ 




Figure 2: Generalized optimal separating hyperplane. The two sets of circles and triangles are 
not linearly separable. The solid line is the optimal separating hyperplane, the filled circles and 
triangles the support vectors (the margin vectors are shown in black, the errors in gray). 

We now conclude this section by discussing the extension of the theory to the nonlinear case. 

2.4 Nonlinear kernels 

In most cases,, linear separation in input space is a too restrictive hypothesis to be of practical 
use. Fortunately, the theory can be extended to nonlinear separating surfaces by mapping the 
input points into feature points and looking for the OSH in the corresponding feature space 
[Cortes and Vapnik 1995]. 

If x G IR™ is an input point, we let V ? ( x ) be the corresponding feature point with (f a mapping 
from ]R n to a certain space Z (typically a Hilbert space of finite or infinite dimension). In both 
cases we denote with ipt the components of (p. Clearly, to an OSH in Z corresponds a nonlinear 
separating surface in input space. 

At hrst sight it might seem that this nonlinear surface cannot be determined unless the mapping 
(f is completely known. However, from the formulation of problem P4 and the classification 
stage of equation (11), it follows that (f enters only in the dot product between feature points, 
since 

Da = y l y J (f{x l ) ■ <p(xj), 

and 

w • <£>(x) + b = Y, a t y t (p(y.i) • <p(x) + b. 

Consequently, if we find an expression for the dot product in feature space which uses the points 
in input space only, that is 

v(x,-)-¥'(xj) = -K r (x t -,x j ), (15) 

6 



full knowledge of (f is not necessary. The symmetric function K in equation (15) is called 
kernel. The nonlinear separating surface can be found as the solution of problem P4 with 
Dij = yiyjK(x.i,x.j), while the classification stage reduces to computing 

sign (J2 anjiKfa, x) + bj . 

Therefore, the extension of the theory to the nonlinear case is reduced to finding kernels which 
identify certain families of decision surfaces and can be written as in equation (15). A useful 
criterion for deciding whether a kernel can be written as in equation (15) is given by Mercer's 
theorem [Courant and Hilbert 1981, Cortes and Vapnik 1995]: a kernel K(x, y), with x,y £ IR n , 
is a dot product in some feature space, or K(x.,y) = (p(x) • v(y)? if an( f only if 

tf(x,y) = tf(y,x) and J j tf(x,y)/(x)/(y)<*x<*y > 0, V/ £ L 2 . 

Given such a kernel K, a possible set of functions (f = ((^1,(^2, • • •) satisfying equation (15) can 
be determined from the eigenfunctions (f>i solution of the eigenvalue problem 

/ #(x,y)£ t -(x)dx = \i(pi(y), (16) 

with tfi = y/Xitfi. If the set of eigenfunctions Cjp is finite, the kernel K is said to be finite and can 
be rewritten as 

K(x, y) = J2 A «-&( x )&(y)> ( 17 ) 

where the sum ranges over the set of eigenfunctions. In the general case, the set (f is infinite, 
the kernel is said to be infinite, and the sum in equation (17) becomes a series or an integral. 
We now give two simple examples of kernels. The hrst is the polynomial kernel 

/i(x,y) = (l+x-y) d , x,y£[-a,a] d . 

It can easily be verified that the polynomial kernel satisfies Mercer's theorem and is finite. The 
separating surface in input space is a polynomial surface of degree d. In this case a mapping (f 
can be determined directly from the definition of K. In the particular case n = 2 and d = 2, for 
example, if x = (xi, x 2 ) we can write 



<£>(x) = (l, V2x t , V2x 2 , xl, xl, V2x 1 x 2 
The second example is the Gaussian kernel 

/_|| X _y||2\ 

#(x,y) = exp I — j , 

for some a £ IR. The Gaussian kernel clearly satisfies Mercer's theorem, but is infinite because 

equation (16) has a continuum of eigenvalues. It is easy to verify that in this case the eigenvalues 

are given by the normalized Fourier Transform of the Gaussian, v27rcrexp( — ||s|| 2 <r 2 /2), with 

exp(zx • s) as corresponding eigenfunctions. The separating surface in input space is a weighted 

sum of Gaussians centered on the support vectors. 

We are now fully equipped to discuss some mathematical properties of the solution of problem 

P4. 



3 Mathematical properties 

The goal is to study the dependence of the OSH on the parameter C . We hrst deal with the 
linear case and then extend the analysis to nonlinear kernels. 

3.1 Lagrange multiplier of a margin vector 

We start by establishing a simple but important result on the Lagrange multipliers of the margin 
vectors. We want to show that the Lagrange multiplier associated with a margin vector is a step- 
wise linear function of the regularization parameter C . To prove it, we need a few preliminary 
definitions. Since there is no risk of confusion, we now write a } 6, and w instead of a } 6, and w. 
We introduce the sets of support vector indexes 

/ = {i : < a.i < C} and J = {i : a 8 - = C}, 

and let M + f and E be the number of indexes in I and J respectively. The set I identifies the 
M + f margin vectors, while J the E errors. While E can also be equal to 0, we suppose that 
there are at least two margin vectors (that is, M > 0). This last hypothesis may not be satisfied 
for highly degenerate configurations of points and small values of C, but does not appear to be 
restrictive in cases of interest. Finally, and with no further loss of generality, we assume that all 
the points are support vectors 3 and, hence, that M + f + E = N. 
We start by sorting the support vectors so that 

1 = 1* (J {JV} and J = {M + f , M + 2, . . . , N - I }, 

with I* = {1,2, ...,M}, and labeling the points so that y N = — 1. The Kuhn- Tucker condi- 
tions (13) for i G 7 tell us that 

j/,-(wx,- + 6) = l. (18) 

Equation (18), by means of (8) and (9), can be rewritten as 

N 

52a j D ji + y i b=l. (19) 

From the equality constraint Y,Vi a i = 0, instead, and since y N = —1 we have 

iV-l 

a N = Y^ a iVi- (20) 

8 = 1 

At the same time, from equation (19) with i = N we get 

N 

b = J2^D 3N -l. (21) 

Plugging equations (20) and (21) into (19) we obtain 

iV-l 

J2 a 3 H Jt = l + yi , iel*. (22) 



3 This follows from the fact that if the points with a; = are discarded, problem P4 has still the same solution. 



where H is the (N — 1) X (N — 1) matrix 

Hij = j/ij/j(xi - x N ) • (x 7 - Xjv). (23) 

Notice that i7 can be written as 

it / -"M ti ME \ 

\ H me H e ) ' 

i7 M being the M X M submatrix between margin vectors, H E the E X E submatrix between 
errors, and H ME the M X E submatrix between margin vectors and errors. Separating the sum 
on margin vectors and errors in equation (22), we find: 

J2 otjHji + CJ2 H 3t = 1 + 2/*, ^ I*. (24) 

jei jeJ 

In vector notation equation (24) rewrites 

-Hm^X-m T ^ tl ME \- E = L M -\- yjK, 

with ol m = («i, a 2 , • • • , %), Ym = (j/i, J/2, • • • , Vm)i an d 1 M and 1 E the M- and £"- vectors with 

all the components equal to unit. 

Assuming that the matrix H M is invertible (see the Appendix for a proof of this fact) we have 



a, 



H M (1 M + y M ) — CH M H ME 1 E . (25) 



From equation (25) we infer that the Lagrange multiplier associated with a margin vector can 
always be written as the sum of two terms. As made clear by the subscript M , the hrst term 
depends only on the margin vectors, while the second is proportional to C and depends on both 
the margin vectors and errors. 

An important consequence of the existence of H^ 1 is that the vectors x 8 - — x N , = 1, 2, . . . , M are 
linearly independent. As a corollary, the number of margin vectors cannot exceed n + 1, that is 
M < n. Notice that this does not mean that the number of points lying on the margin cannot 
exceed n + 1. In degenerate cases, there may be points lying on the margin with a = 0, or even 
support vectors lying on the margin with a = C. 

3.2 Dependence on the regularization parameter 

We are now in a position to study the dependence of the OSH on the parameter C. We hrst 
show that the normal to the OSH can be written as the sum of two orthogonal vectors. 

3.2.1 Orthogonal decomposition 

In components equation (25) can be rewritten 

ai = Vi + QiC % G /*, (26) 

with 

r M = H' 1 (1 M + y M ) (27) 



and 

gM = — H~ H ME 1 E . (28) 

Notice that the r 8 - and §i are not necessarily positive (although they cannot be both negative). 
If we define 

r N = Y r iVi ( 29 ) 

861* 

9n = Y,9iVi + Y,yi> ( 30 ) 

then equation (26) is also true for the margin vector of index N as 

r N + g N C = Y r % y % + Y g % y % C + Y y t C = Y y % a % + CYvi = a N , 

861* 861* 'G J 861* 'G J 

where the last equality is due to the constraint (6) and the fact that a 8 - = C for all i £ J. 
Plugging equation (26) into (9) and separating the constant and linear term we obtain 

w = Wi + Cw 2 , (31) 

with 

wi = ^Vij/iXi, (32) 

861 

w 2 = Y^yiXi + Y^giViXi. (33) 

i£J 86-T 

It can easily be seen that Wi and w 2 are orthogonal. Substituting equations (29) and (30) into 
(32) and (33) respectively, one obtains 

Wi = Y r *2/*( x * ~ x ^), 

861* 

w 2 = Y ^'( x « -*??) + Y 9iyi{*i - x N ). 

86J 861* 

Then, through the definition of H M and H ME we have 

wi • w 2 = r M H ME l E + r M H M g M . (34) 

Plugging equation (28) in (34) it follows immediately that Wi • w 2 = 0. 

3.2.2 Changing the regularization parameter 

We now study the effect of small changes of the regularization parameter C on the OSH. Since C 
is the only free parameter of SVMs, this study is relevant from both the theoretical and practical 
viewpoint. In what follows we let C take on values over the positive real axis IR + . First, we 
notice that the possible choices of support vectors for all possible values of C (distinguishing 
between margin vectors and errors) are finite. If we neglect degenerate configurations of support 
vectors, this implies that IR + can be partitioned in a finite number of disjoint interval, each 
characterized by a fixed set of support vectors. Notice that the rightmost interval is necessarily 
unbounded. 

10 



After this preliminary observation we can already conclude that, with the exception of the C 
values corresponding to the interval ends, the set of support vectors does not vary for small 
changes of C . But through the previous analysis we can also study the dependence of the normal 
vector w on the parameter C . From equation (31) it follows that if C changes by 8C and the 
margin vectors and errors remain the same, the normal vector w changes by 6Cw 2 along the 
direction of w 2 . We can make this statement more precise distinguishing between two cases. 
In the hrst case we let M reach the maximum value n. Since H M has always maximum rank, we 
have n + 1 independent Kuhn- Tucker conditions like equation (18) and the OSH is completely 
determined by the n + 1 margin vectors. Consequently, since for almost all C the set of support 
vectors remains the same for small changes of C, w 2 must vanish and we have 

w = Yl fiVi^i- ( 35 ) 

iei 

Equation (35) tells us that if M = n the OSH is fixed and unambiguously identified by the n + 1 
margin vectors. The fact that the OSH is fixed makes it possible to determine the maximum 
interval around C, say (Ci, C2], in which the OSH is given by equation (35). To this purpose it 
is sufficient to compute the r 8 - and §i from equations (27) and (28) and find C\ and C 2 as the 
minimum and maximum C for which the a 8 - associated with the margin vector x 8 - satisfy the 
constraint < a 8 - < C . 

In the second case, we have M < n. The OSH is now given by equation (31) with w 2 7^ 0. Thus 
for a small change 8C the new OSH w' can be written as 

w' = w + 8Cw 2 . (36) 

Equation (36) tells us that if M < n the OSH changes of an amount 6Cw 2 . Here again there exists 
a maximum interval (Ci, C 2 \ around C in which the OSH is given by equation (36). Similarly to 
the previous case, one could determine the minimum and maximum C for which the a 8 - associated 
with the margin vectors satisfy the constraint < a 8 - < C . However, since to a changing OSH 
might correspond a new set of support vectors, these minimum and maximum values are only a 
lower and upper bound for C\ and C 2 respectively. 

Finally, we observe that even if M < n, the OSH can always be written as a linear combination 
of n + 1 support vectors, for example by adding n + 1 — M errors. 

3.2.3 A numerical example 

We now illustrate both cases by means of the numerical example with n = 2 shown in figure 
3. figure 3(a) shows the OSH found for the displayed training set with C = 4.0. The support 
vectors are denoted by the filled circles and triangles (the margin vectors in black, the errors 
in grey). In accordance with equation (35), since there are 3 margin vectors the OSH is fixed. 
Straightforward computations predict that the OSH must remain the same for 2.7 < C < 4.5. 
This prediction has been verified numerically. 

Figure 3(6) shows the new OSH obtained for C just outside the interval (2.7,4.5] (C = 4.8). 
Notice that the errors are the same of figure 3(a), while there are only two margin vectors. As 
we have just discussed, the OSH should now change for small variations of C as predicted by 
equation (36). This has been verified numerically and figure 3(c) displays the OSHs obtained 
from equation (36) and from direct solution of the problem P4 for C = 6.7. The two OSH 
coincide within numerical precision. 

11 
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Figure 3: Optimal separating hyperplane for C = 4.0 (a), C = 4.8 (6), C = 6.7 (c), and C = 7.5 
(J) respectively. Legend as in figure 2. 

For a larger variation of C (C > 7.0, see figure 3(d)) the number of margin vectors goes back to 
3 and the solution is again fixed. Notice that in this last transition one of the errors became a 
margin vector (the error in the upper part of the margin strip of figure 3(c) is a margin vector 
in figure 3(d)). 

As mentioned in the previous section, it is worthwhile noticing that the solutions with smaller C 
(see figure 3(a) and (b)) have a larger margin, while the solutions with larger C (see figure 3(c) 
and (d)) have a smaller number of errors. 

3.3 Extension to nonlinear kernels 

We now extend the presented analysis to the case of nonlinear kernels. 

Lagrange multiplier of a margin vector We start by observing that the same decomposition 
of the Lagrange multiplier of a margin vector derived in the linear case holds true for nonlinear 
kernels. Note that the matrix H of equation (23) rewrites 



Hij = y t yj (A"(x 8 ,Xj) - A^(x J5 x^) - A"(x 8 ,x N ) + A^(x^,x^)) 
while equations (25) to (30) remain unchanged. 



(37) 
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Orthogonal decomposition More care is needed for the extension of the orthogonal decom- 
position of w and the study of the behavior of the separating surface as a function of C . This 
is because, in the nonlinear case, it may not be possible to recover an explicit expression for 
w. However, this does not pose major problems because all the expressions involving w are 
effectively dot products between feature points and can be computed by means of the kernel K. 
Indeed, if we take the dot product between w and <£>(x), we obtain 

N 

w ' ¥>( x ) = X^^( x «> x )> 

8 = 1 

that can be written as 

N 

8 = 1 8gl 

+ qEs/^xj + Va^x) . (38) 



The two terms in the r.h.s. of equation (38) are the counterparts of equations (32) and (33) 
defining Wi and w 2 respectively. Note that even if the explicit expression for Wi and w 2 cannot 
be given, the orthogonality relation (34) remains true. This can be seen from the fact that the 
r.h.s. of equation (34) depends on the matrix H which, in the nonlinear case, is rewritten as in 
equation (37). In this respect, the two terms in the r.h.s. of equation (38) can be regarded as 
orthogonal. 

Changing the regularization parameter So far, all the results derived in the linear case 
carried through the case of nonlinear kernels. For the dependence of the separating surface on 
the parameter C, instead, it is convenient to distinguish between finite and infinite kernels. 
For finite kernels, all the results obtained in the linear case are still valid and can be rederived 
simply replacing n, dimension of input space, with m, dimension of feature space. For example, 
if M = m, the OSH in feature space does not change for small changes of C and the second term 
in the r.h.s of equation (38) vanishes for all x. Furthermore, the interval (Ci, C2], within which 
the OSH is fixed, can be determined exactly as in the linear case. 

For kernels of infinite dimension, instead, a finite number of margin vectors is not sufficient to 
fully determine the OSH. Consequently and differently from the finite case, the OSH is never 
fixed and the second term of equation (38) does not vanish. For a small change 5C, the dot 
product w • (p(x) changes of the amount 

8C J2 yj K ( x 3^) + J2 giyiKfc, x) . 

\j€J 861 / 

In summary, all the results derived in the linear case can be extended without major changes 
to the nonlinear case, with the exception of the properties depending on the fmiteness of the 
dimension of the linear case, like the upper bound on the number of margin vectors, properties 
that are still true for finite kernels only. 

4 Conclusions 

In the case of pattern recognition, SVMs depend only one free parameter, the regularization 
parameter C. In this paper we have discussed some mathematical properties of support vectors 
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useful to characterize the behavior of the decision surface with respect to C . We have identified 
a special subset of support vectors, the margin vectors, whose Lagrange multiplier are strictly 
smaller than the regularization parameter C . We have shown that the margin vectors are always 
linearly independent and that the decision surface can be written as the sum of two orthogonal 
terms, the hrst depending only on the margin vectors, the second proportional to the regular- 
ization parameter. For almost all values of the parameter, this enabled us to predict how the 
decision surface varies for small parameter changes. In general we found that the solution is 
usually stable with respect to small changes of C . 

The obtained results can be more conveniently summarized distinguishing between finite and 
inhnte kernels. For kernels of finite dimension ra, it turned out that ra + I is the least upper 
bound for the number of margin vectors (M + I) and the behavior of the OSH as a function of 
C depends on whether M = ra or M < ra. If Af = ra, the M + I margin vectors are sufficient 
to fully determine the equation of the OSH in feature space and for almost all values of C the 
OSH does not vary for small changes of C . If M < ra, instead, the OSH varies of an amount 
proportional to the change 8C in a direction identified by both the margin vectors and errors. 
In both cases it is worthwhile observing that the number of support vectors effectively needed 
to identify the decision surface is never greater than ra + 1. This latter result may be useful to 
reduce the number of support vectors effectively needed to perform recognition. 
For infinite kernels, the margin vectors are still linearly independent but there is no upper bound 
on their number. For small changes of C the OSH is not fixed and varies as in the case M < ra 
of finite kernels. 

Acknowledgements. Edgar Osuna read the manuscript and made useful remarks. This work 
has been partially supported by a grant from the Agenzia Spaziale Italiana. 

Appendix 

In this appendix we sketch the proof of the existence of H' 1 . First, we need to (a) transform 
the original dual problem P4 into a Linear Complementary Problem (LCP), and (6) derive the 
explicit expression for the matrix G which defines the polyhedral set on which the solution of 
the LCP lies. 

Let us define a = (ai,a 2 , • • • } a N _ 1 ) and remind that a N = Y,Vi a i where the sum ranges over 
z = 1, 2, . . . , TV — 1. We let Ni and N 2 be the number of points with positive and negative labels 
respectively. We start by rewriting problem P4 without the equality constraint as 

Problem P5 

Minimize —a ■ Ha — 2 ^ a 8 - 

861 + 

iV-1 iV-1 

subject to — 22 Vi a i — 0> 2_/ y iCXi — C 

a/<C, 8_1 z = 1,2, ... ,7V - 1 

a t - > 0, i = 1,2, ...,7V- 1 

with I + the set of indexes corresponding to the a 8 - for which j/ 8 - = 1. Then, we let u +} u_, 
u = (ui, u 2} ■ ■ ■ , u N _ 1 ) } and v = (^i, v 2} . . . , v N _ 1 ) be the 2 N Lagrange multipliers associated with 
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the constraints of problem P5 respectively. 

The LCP associated with problem P5 is obtained by 

1. setting equal to the gradient of the Lagrangian associated with problem P5, or 

iV-l 

Y ci J H Jl - 1 + yi{u + - u_) -yi + Ui - Vi = 0, 



anc 



2. introducing the N + 1 slack variables 4 s + , s_, and s = (si, s 2 , . . . , s N _ 1 ), satisfying 

iV-l 

s + + Y a ^y^ = °' 



s~ - Y a *y* = c, 



an> 



d 



Si + OLi = C, 

along with the associated complementary conditions 

s_u_ = s + u + = 0, 
SiUi = 0, 



anc 



for each z = 1, 2, . . . , iV — 1. 



a,;u, : = 0, 



The solution of problem P5 can be obtained as the solution of the LCP 

Problem P6 

Solve t — Mz = q 

subject to t, z > 



i = 1,2,..., 27V, 



with t = (s_, s_|_, s, v), z = (u_, u_|_, u, a), 

M = 
/ 

A = 

V 



( ° 


-^M 


U 


*J' 


J/1 


-Vn-i 


2/i 


Vn-i 



J-N-l 



J 



4 In the constrained optimization jargon, a slack variable is a nonnegative variable that turns an inequality 
into an equality constraint. 
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;b,k), 



JV+l 




b = (0, C, . . . , C), and k = 

Similarly to the case of linear programming, a solution to Problem P6 is a vertex of a poly- 
hedral set. In addition, the solution must also satisfy the complementarity conditions. In 
the case of problem P6, a solution vector p = (t,z) is a vertex of the polyhedral set S = 
{p : Gp = q, p > 0}, with G = [I 2N ,-M], p = (p B ,p N ), p B = i? _1 q, p N = 0, and B is the 
2N X 2N matrix defined by the columns of G corresponding to the 2N active variables. 
Through simple but lengthy calculations, it can be seen that the matrix H M is a submatrix of B 
and H^ 1 a submatrix of B~ x . The existence of H^ 1 is thus ensured by the existence of B~ x . 
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